Importing the relevant python libaries#

import networkx as nx
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import csv
import time
import preprocessing
import random
import statsmodels

Table of contents#

  1. Introduction

  2. Datasets and preprocessing

  3. Who knows whom? - Degree distribution

  4. Connectedness(to be writting in final version)

  5. Small world property

  6. Friendships in communities - Triadic closure

  7. Conclusion

  8. References

  9. Appendix

General todo: elke vis moeten een titel en namen bij de axis hebben

1. Introduction#

Have you ever wondered how people influence each other? Or how concepts, ideas and facts spread through networks of persons? These types of questions related to social influence, can be answered with the help of network science. The aim of this project is to use network science theory to compare properties of social networks. A network is a web that consists of nodes and edges. People, devices, accounts or other objects are represented as nodes and the edges represent the relationships between these nodes. All these nodes and edges together form a network.

In real life we meet new friends through acquaintances. It is plausible that social influence is also spread through acquaintances. Fortunately, the network datasets we are going to use for this project, also provide node attributes. These attibutes are often genre or artist preferences with music related networks. In consequence, similarities between entities in a network can be calculated and compared.

The concept of social influence can be transformed to the following hypothesis:

If two nodes are similar in the network, then they must have similar node attributes.

Or written in first order logic:

  • \(GraphSim(x,y)\): x is similar to y in the network

  • \(AttrSim(x,y)\): x and y have similar attributes

\(\forall x\forall y(GraphSim(x,y)\rightarrow AttrSim(x,y))\)

In order to answer this question, we must first define what similarity is and how to compute similarity. For similarity in graphs there are two popular similarity measures: jaccard similarity and the overlap coefficient:

\[ \textrm{jaccard}(A, B)=\frac{|A \cap B|}{|A \cup B| - |A \cap B|} \]
\[ \textrm{overlap coefficient}(A, B)=\frac{|A \cap B|}{min(|A|, |B|)} \]

For the attribute similarity it is obvious to also use the jaccard similarity. This is a commonly used measure for categorical data, something that attributes certainly are.

Leaping into the future, we were able to successfully compute these similarities on a subset of the networks. By using cugraph in a Google Colab notebook we were able to use Google’s GPUs for efficient computing of these similarities. The code that was used to construct csv files with all the similarity data for different networks, can be found in the appendix. Let’s load in the files and see what this can teach us about social influence.

start= time.time() # om runtime te berekenen
similarities_file_paths = {
    'FacebookLargepage' : 'datasets/facebook-large-page-page-network/musae_facebook_final_similarity.csv',
    'FeatherDeezerSocial' : 'datasets/feather-deezer-social/deezer_europe_final_similarity.csv',
    'GithubSocial' : 'datasets/github-social/musae_git_final_similarity.csv',
    'TwitchSocialNetworksDE' : 'datasets/twitch-social-networks/DE/musae_DE_final_similarity.csv',
    'TwitchSocialNetworksENGB' : 'datasets/twitch-social-networks/ENGB/musae_ENGB_final_similarity.csv',
    'TwitchSocialNetworksES' : 'datasets/twitch-social-networks/ES/musae_ES_final_similarity.csv',
    'TwitchSocialNetworksFR' : 'datasets/twitch-social-networks/FR/musae_FR_final_similarity.csv',
    'TwitchSocialNetworksPTBR' : 'datasets/twitch-social-networks/PTBR/musae_PTBR_final_similarity.csv',
    'TwitchSocialNetworksRU' : 'datasets/twitch-social-networks/RU/musae_RU_final_similarity.csv',
}

For this visualisation, all the data from all the different networks is aggregated.

frames = []
for i in similarities_file_paths.values():
    x = pd.read_csv(i, index_col=0)
    frames.append(x)
similarity_dataframe = pd.concat(frames)
# Uses qcut to convert to categorical variables
jaccard_coeff_cut, cut_bins = pd.qcut(similarity_dataframe['jaccard_coeff'], 3, retbins=True, labels=["Not Similar", "Average", "Similar"])
overlap_coeff_cut, cut_bins = pd.qcut(similarity_dataframe['overlap_coeff'], 3, retbins=True, labels=["Not Similar", "Average", "Similar"])
jaccard_genre_coeff_cut, cut_bins = pd.qcut(similarity_dataframe['jaccard_genre_sim'], 3, retbins=True, labels=["Not Similar", "Average", "Similar"])
jaccard_coeff_dim = go.parcats.Dimension(
    values=jaccard_coeff_cut,
    categoryorder='category descending', label="Jaccard coeff"
)

overlap_coeff_dim = go.parcats.Dimension(
    values=overlap_coeff_cut,
    categoryorder='category descending', label="Overlap coeff"
)

jaccard_genre_coeff_dim = go.parcats.Dimension(
    values=jaccard_genre_coeff_cut,
    categoryorder='category descending', label="Attribute similarity"
)

sim = go.Figure(data = [go.Parcats(dimensions=[jaccard_coeff_dim, overlap_coeff_dim, jaccard_genre_coeff_dim],
                                  )])
sim.update_layout(title="Graph similarity and attribute similarity")

sim.show()
similarity_dataframe.corr()[['jaccard_coeff', 'overlap_coeff', 'jaccard_genre_sim']].loc[['jaccard_coeff', 'overlap_coeff', 'jaccard_genre_sim']]
jaccard_coeff overlap_coeff jaccard_genre_sim
jaccard_coeff 1.000000 0.646626 0.024894
overlap_coeff 0.646626 1.000000 0.001387
jaccard_genre_sim 0.024894 0.001387 1.000000

(Not) being an influencer - Are these networks even real networks?#

Although it sounds plausible that similar nodes in a graph might have similar attributes, the above chart does imply that the hypothesis is invalid. By looking at the pearson correlation, it can be substantiated that the hypothesis is indeed invalid with pearson correlation values close to zero.

This raises the question whether or not the above networks are indeed real networks. To answer this question, a distinction must be made between random networks and real networks. According Barabasi (2016), a random network consists of N nodes where each node pair is connected with probability p. A random network is also called an Erdos-Renyi network in honor of the creators. Because a random network is constructed by chance, the characteristics are typical for this randomness. Barabasi also explains that real networks have significantly different network characteristics since these networks are not constructed by chance.

The goal of this project is to analyse different social networks along different network properties, through both the perspective of random networks, and real networks. With this analysis, it can be answered to which extend the different networks are indeed real networks.

2. Datasets and preprocessing#

This project is based on three different datasets of social networks that can be found on Kaggle (Appendix 2). Each dataset is a representation of a network in the form of a list of edges. To some datasets there is also a file added with node attributes. This file is often a .json file. The latest dataset is slightly different from the first two. Unique to this dataset, is the fact that it is a dataset containing multiple datasets of social networks.

Before the datasets can be read in, it is necessary to manually select which datasets will and will not be selected for the project. This selection is based on the size of the datasets to reduce the risk of very long processing times. Besides, any not connected network is excluded from the project to simplify calculation tasks. The table below is an overview of all the different networks, retrieved from the datasets, that will be used in this project. For each network, it is given what its node representation is, what its link representation is and what the node attributes are if applicable.

Dataset

Node representation

Link representation

Node attributes if applicable

NashvilleMeetupNetwork

Member of a Meetup group

Shared group membership in ‘weight’ groups

N/A

DeezerHR

Deezer users from Croatia

The relation friendship

Genre preferences of each user

DeezerHU

Deezer users from Hungary

The relation friendship

Genre preferences of each user

DeezerRO

Deezer users from Romania

The relation friendship

Genre preferences of each user

FacebookLargePage

Official Facebook pages

The amount of mutual likes between pages

Descriptions of the purpose of the site

FeatherDeezerSocial

Deezer users from Europe

The relation friendship

Artists liked by the users

FeatherLastfmSocial

Lastfm users from Asia

The relation friendship

Artists liked by the users

GithubSocial

Developers who have starred at least 10 repositories

Mutual follwing between developers

Location, Repositories starred, Employer and E-mail address

TwitchSocialNetworks

Twitch users

The relation friendship

Games played and liked, Location and Streaming habits

TwitchSocialNetworksDE

Twitch users from Germany

The relation friendship

Games played and liked, Location and Streaming habits

TwitchSocialNetworksENGB

Twitch users from England

The relation friendship

Games played and liked, Location and Streaming habits

TwitchSocialNetworksES

Twitch users from Spain

The relation friendship

Games played and liked, Location and Streaming habits

TwitchSocialNetworksFR

Twitch users from France

The relation friendship

Games played and liked, Location and Streaming habits

TwitchSocialNetworksPTBR

Twitch users from Brazil

The relation friendship

Games played and liked, Location and Streaming habits

TwitchSocialNetworksRU

Twitch users from Russia

The relation friendship

Games played and liked, Location and Streaming habits

SocEpinions1

Epinion users

Trust relationship between users

N/A

SocSignSlashdot090221

Slashdot users 2009

Friend or enemy relationship

N/A

SocSlashdot0902

Slashdot users 2009

Friend or enemy relationship

N/A

Preprocessing non-csv datasets#

Some datasets of networks have their edges saved in a .txt file while others in a .csv file. In order to read the edges with the pandas read_csv function those .txt files need to be converted to .csv files. A function is used to accomplish this that can be found in the preprocessing.py file.

  • soc-sign-Slashdot090221.txt

  • soc-Slashdot0902.txt

# preprocessing.transform_text_to_csv('datasets/soc-sign-Slashdot090221.txt', 'datasets/soc-sign-Slashdot090221.csv')
# preprocessing.transform_text_to_csv('datasets/soc-Slashdot0902.txt', 'datasets/soc-Slashdot0902.csv')

Read in datasets with pandas#

The datasets can be easily read in with pandas.

NashvilleMeetupNetwork_member_edges = pd.read_csv('datasets/NashvilleMeetupNetwork/member-edges.csv', index_col=0)
DeezerHR_edges = pd.read_csv('datasets/DeezerSocialNetworks/HR/HR_edges.csv')
DeezerHU_edges = pd.read_csv('datasets/DeezerSocialNetworks/HU/HU_edges.csv')
DeezerRO_edges = pd.read_csv('datasets/DeezerSocialNetworks/RO/RO_edges.csv')
FacebookLargePage_edges = pd.read_csv('datasets/facebook-large-page-page-network/musae_facebook_edges.csv')
FeatherDeezerSocial_edges = pd.read_csv('datasets/feather-deezer-social/deezer_europe_edges.csv')
FeatherLastfmSocial_edges = pd.read_csv('datasets/feather-lastfm-social/lastfm_asia_edges.csv')
GithubSocial_edges = pd.read_csv("datasets/github-social/musae_git_edges.csv")
TwitchSocialNetworksDE_edges = pd.read_csv("datasets/twitch-social-networks/DE/musae_DE_edges.csv")
TwitchSocialNetworksENGB_edges = pd.read_csv("datasets/twitch-social-networks/ENGB/musae_ENGB_edges.csv")
TwitchSocialNetworksES_edges = pd.read_csv("datasets/twitch-social-networks/ES/musae_ES_edges.csv")
TwitchSocialNetworksFR_edges = pd.read_csv("datasets/twitch-social-networks/FR/musae_FR_edges.csv")
TwitchSocialNetworksPTBR_edges = pd.read_csv("datasets/twitch-social-networks/PTBR/musae_PTBR_edges.csv")
TwitchSocialNetworksRU_edges = pd.read_csv("datasets/twitch-social-networks/RU/musae_RU_edges.csv")
SocSignSlashdot090221_edges = pd.read_csv('datasets/soc-sign-Slashdot090221.csv')
SocSlashdot0902_edges = pd.read_csv('datasets/soc-Slashdot0902.csv')

Construct graphs with networkx#

After that, undirected networkx graphs can be constructed from the pandas dataframes.

NashvilleMeetupNetwork = nx.from_pandas_edgelist(NashvilleMeetupNetwork_member_edges, 'member1', 'member2', edge_attr='weight', create_using=nx.Graph)
DeezerHR = nx.from_pandas_edgelist(DeezerHR_edges, 'node_1', 'node_2', create_using=nx.Graph)
DeezerHU = nx.from_pandas_edgelist(DeezerHU_edges, 'node_1', 'node_2', create_using=nx.Graph)
DeezerRO = nx.from_pandas_edgelist(DeezerRO_edges, 'node_1', 'node_2', create_using=nx.Graph)
FacebookLargePage = nx.from_pandas_edgelist(FacebookLargePage_edges, 'id_1', 'id_2', create_using=nx.Graph)
FeatherDeezerSocial = nx.from_pandas_edgelist(FeatherDeezerSocial_edges, 'node_1', 'node_2', create_using=nx.Graph)
FeatherLastfmSocial = nx.from_pandas_edgelist(FeatherLastfmSocial_edges, 'node_1', 'node_2', create_using=nx.Graph) 
GithubSocial = nx.from_pandas_edgelist(GithubSocial_edges, 'id_1', 'id_2', create_using=nx.Graph)
TwitchSocialNetworksDE = nx.from_pandas_edgelist(TwitchSocialNetworksDE_edges, 'from', 'to', create_using=nx.Graph)
TwitchSocialNetworksENGB = nx.from_pandas_edgelist(TwitchSocialNetworksENGB_edges, 'from', 'to', create_using=nx.Graph)
TwitchSocialNetworksES = nx.from_pandas_edgelist(TwitchSocialNetworksES_edges, 'from', 'to', create_using=nx.Graph)
TwitchSocialNetworksFR = nx.from_pandas_edgelist(TwitchSocialNetworksFR_edges, 'from', 'to', create_using=nx.Graph)
TwitchSocialNetworksPTBR = nx.from_pandas_edgelist(TwitchSocialNetworksPTBR_edges, 'from', 'to', create_using=nx.Graph)
TwitchSocialNetworksRU = nx.from_pandas_edgelist(TwitchSocialNetworksRU_edges, 'from', 'to', create_using=nx.Graph)
SocSignSlashdot090221 = nx.from_pandas_edgelist(SocSignSlashdot090221_edges, '# FromNodeId', 'ToNodeId', edge_attr='Sign' , create_using=nx.Graph)
SocSlashdot0902 = nx.from_pandas_edgelist(SocSlashdot0902_edges, '# FromNodeId', 'ToNodeId', create_using=nx.Graph)

Destriptive statistics of different networks#

The final part of the preprocessing part consists of calculating different network properties for each network. The visualizations section explains what the different properties mean in more detail.

networks = {
    'NashvilleMeetupNetwork' : NashvilleMeetupNetwork,
    'DeezerHR' : DeezerHR,
    'DeezerHU' : DeezerHU,
    'DeezerRO' : DeezerRO,
    'FacebookLargepage' : FacebookLargePage,
    'FeatherDeezerSocial' : FeatherDeezerSocial,
    'FeatherLastfmSocial' : FeatherLastfmSocial, 
    'GithubSocial' : GithubSocial,
    'TwitchSocialNetworksDE' : TwitchSocialNetworksDE,
    'TwitchSocialNetworksENGB' : TwitchSocialNetworksENGB,
    'TwitchSocialNetworksES' : TwitchSocialNetworksES,
    'TwitchSocialNetworksFR' : TwitchSocialNetworksFR,
    'TwitchSocialNetworksPTBR' : TwitchSocialNetworksPTBR,
    'TwitchSocialNetworksRU' : TwitchSocialNetworksRU,
    'SocSignSlashdot090221' : SocSignSlashdot090221,
    'SocSlashdot0902' : SocSlashdot0902,
}

The code below uses networkx and will take some time to run. That is why the results are written to a csv file that can be read in later.

# start = time.time()
# descriptive_stats = pd.DataFrame(index=list(networks.keys()), columns=['number_of_nodes', 'number_of_edges', 'average_degree', 'density', 'connected_network', 'avg_cc', 'transitivity', 'average_shortest_path_length', 'diameter', 'has_bridges', 'number_of_bridges', 'max_degree'])
# for name, network in networks.items():
#     descriptive_stats['number_of_nodes'].loc[name] = nx.number_of_nodes(network)
#     descriptive_stats['number_of_edges'].loc[name] = nx.number_of_edges(network)
#     descriptive_stats['average_degree'].loc[name] = np.mean(list(dict(network.degree()).values()))
#     descriptive_stats['density'].loc[name] = nx.density(network)
#     descriptive_stats['connected_network'].loc[name] = nx.is_connected(network)
#     descriptive_stats['avg_cc'].loc[name] = nx.average_clustering(network)
#     descriptive_stats['transitivity'].loc[name] = nx.transitivity(network)
#     descriptive_stats['has_bridges'].loc[name] = nx.has_bridges(network)
#     descriptive_stats['number_of_bridges'].loc[name] = len(list(nx.bridges(network)))
#     descriptive_stats['max_degree'].loc[name] = max(dict(nx.degree(network)).values())
    
# descriptive_stats.to_csv('results/descriptive_stats_of_networks_2.csv')
# end = time.time()
# print(f'tijd in seconden:{end - start}')

Networkx can be used to calculate many different network properties as seen above. However, some network properties require more efficient algorithms to compute.

Below this cell all the code can be found that was used to calculated additional network properties of the network. This code was run in a google Colab notebook because this allows us to use Google’s GPU for gpu accelerated computing. This was needed because the calculation of some network properties require a lot of processing power. Instead of pandas and networkx the python libaries cudf and cugraph were used. cudf and cugraph are similar alternatives that are based on CUDA.

# import cudf
# import cugraph as cnx
# import time
# import numpy as np
# import json
# import itertools

# # Reading in the data with cudf
# NashvilleMeetupNetwork_member_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/NashvilleMeetupNetwork/member-edges.csv', index_col=0)
# DeezerHR_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/DeezerSocialNetworks/HR/HR_edges.csv')
# DeezerHU_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/DeezerSocialNetworks/HU/HU_edges.csv')
# DeezerRO_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/DeezerSocialNetworks/RO/RO_edges.csv')
# FacebookLargePage_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/facebook-large-page-page-network/musae_facebook_edges.csv')
# FeatherDeezerSocial_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/feather-deezer-social/deezer_europe_edges.csv')
# FeatherLastfmSocial_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/feather-lastfm-social/lastfm_asia_edges.csv')
# GithubSocial_edges = cudf.read_csv("/content/drive/MyDrive/Colab Notebooks/datasets/github-social/musae_git_edges.csv")
# TwitchSocialNetworksDE_edges = cudf.read_csv("/content/drive/MyDrive/Colab Notebooks/datasets/twitch-social-networks/DE/musae_DE_edges.csv")
# TwitchSocialNetworksENGB_edges = cudf.read_csv("/content/drive/MyDrive/Colab Notebooks/datasets/twitch-social-networks/ENGB/musae_ENGB_edges.csv")
# TwitchSocialNetworksES_edges = cudf.read_csv("/content/drive/MyDrive/Colab Notebooks/datasets/twitch-social-networks/ES/musae_ES_edges.csv")
# TwitchSocialNetworksFR_edges = cudf.read_csv("/content/drive/MyDrive/Colab Notebooks/datasets/twitch-social-networks/FR/musae_FR_edges.csv")
# TwitchSocialNetworksPTBR_edges = cudf.read_csv("/content/drive/MyDrive/Colab Notebooks/datasets/twitch-social-networks/PTBR/musae_PTBR_edges.csv")
# TwitchSocialNetworksRU_edges = cudf.read_csv("/content/drive/MyDrive/Colab Notebooks/datasets/twitch-social-networks/RU/musae_RU_edges.csv")
# SocSignSlashdot090221_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/soc-sign-Slashdot090221.csv')
# SocSlashdot0902_edges = cudf.read_csv('/content/drive/MyDrive/Colab Notebooks/datasets/soc-Slashdot0902.csv')

# # Constructing networks with cugraph
# NashvilleMeetupNetwork = cnx.from_cudf_edgelist(NashvilleMeetupNetwork_member_edges, 'member1', 'member2', edge_attr='weight', create_using=cnx.Graph)
# DeezerHR = cnx.from_cudf_edgelist(DeezerHR_edges, 'node_1', 'node_2', create_using=cnx.Graph)
# DeezerHU = cnx.from_cudf_edgelist(DeezerHU_edges, 'node_1', 'node_2', create_using=cnx.Graph)
# DeezerRO = cnx.from_cudf_edgelist(DeezerRO_edges, 'node_1', 'node_2', create_using=cnx.Graph)
# FacebookLargePage = cnx.from_cudf_edgelist(FacebookLargePage_edges, 'id_1', 'id_2', create_using=cnx.Graph)
# FeatherDeezerSocial = cnx.from_cudf_edgelist(FeatherDeezerSocial_edges, 'node_1', 'node_2', create_using=cnx.Graph)
# FeatherLastfmSocial = cnx.from_cudf_edgelist(FeatherLastfmSocial_edges, 'node_1', 'node_2', create_using=cnx.Graph)
# GithubSocial = cnx.from_cudf_edgelist(GithubSocial_edges, 'id_1', 'id_2', create_using=cnx.Graph)
# TwitchSocialNetworksDE = cnx.from_cudf_edgelist(TwitchSocialNetworksDE_edges, 'from', 'to', create_using=cnx.Graph)
# TwitchSocialNetworksENGB = cnx.from_cudf_edgelist(TwitchSocialNetworksENGB_edges, 'from', 'to', create_using=cnx.Graph)
# TwitchSocialNetworksES = cnx.from_cudf_edgelist(TwitchSocialNetworksES_edges, 'from', 'to', create_using=cnx.Graph)
# TwitchSocialNetworksFR = cnx.from_cudf_edgelist(TwitchSocialNetworksFR_edges, 'from', 'to', create_using=cnx.Graph)
# TwitchSocialNetworksPTBR = cnx.from_cudf_edgelist(TwitchSocialNetworksPTBR_edges, 'from', 'to', create_using=cnx.Graph)
# TwitchSocialNetworksRU = cnx.from_cudf_edgelist(TwitchSocialNetworksRU_edges, 'from', 'to', create_using=cnx.Graph)
# SocSignSlashdot090221 = cnx.from_cudf_edgelist(SocSignSlashdot090221_edges, '# FromNodeId', 'ToNodeId', edge_attr='Sign' , create_using=cnx.Graph)
# SocSlashdot0902 = cnx.from_cudf_edgelist(SocSlashdot0902_edges, '# FromNodeId', 'ToNodeId', create_using=cnx.Graph)

# networks = {
#     'NashvilleMeetupNetwork' : NashvilleMeetupNetwork,
#     'DeezerHR' : DeezerHR,
#     'DeezerHU' : DeezerHU,
#     'DeezerRO' : DeezerRO,
#     'FacebookLargepage' : FacebookLargePage,
#     'FeatherDeezerSocial' : FeatherDeezerSocial,
#     'FeatherLastfmSocial' : FeatherLastfmSocial,
#     'GithubSocial' : GithubSocial,
#     'TwitchSocialNetworksDE' : TwitchSocialNetworksDE,
#     'TwitchSocialNetworksENGB' : TwitchSocialNetworksENGB,
#     'TwitchSocialNetworksES' : TwitchSocialNetworksES,
#     'TwitchSocialNetworksFR' : TwitchSocialNetworksFR,
#     'TwitchSocialNetworksPTBR' : TwitchSocialNetworksPTBR,
#     'TwitchSocialNetworksRU' : TwitchSocialNetworksRU,
#     'SocSignSlashdot090221' : SocSignSlashdot090221,
#     'SocSlashdot0902' : SocSlashdot0902,
# }

# # Code used to compute the average shortest path length and diameter
# start = time.time()
# descriptive_stats = cudf.DataFrame(index=list(networks.keys()), columns=['average_shortest_path_length', 'diameter'])
# for name, network in networks.items():
#     shortest_path_lengths = [cnx.bfs_edges(network, source=x)['distance'].max() for x in network.nodes().to_numpy()]
#     descriptive_stats['average_shortest_path_length'].loc[name] = sum(shortest_path_lengths)/len(shortest_path_lengths)
#     descriptive_stats['diameter'].loc[name] = max(shortest_path_lengths)
#     descriptive_stats.to_csv('/content/drive/MyDrive/Colab Notebooks/results/descriptive_stats_of_networks.csv')
# end = time.time()
# print(f'tijd in seconden:{end - start}')

Combining the calculatings from networkx and cugraph gives the following dataframe:

descriptive_stats = pd.read_csv('results/descriptive_stats_of_networks.csv', index_col=0)
descriptive_stats
number_of_nodes number_of_edges average_degree density connected_network avg_cc transitivity average_shortest_path_length diameter has_bridges number_of_bridges max_degree
NashvilleMeetupNetwork 11372 1176368 206.888498 0.018194 True 0.884957 0.604407 4.413648 6 True 13 2356
DeezerHR 54573 498202 18.258186 0.000335 True 0.136477 0.114630 8.501347 12 True 2435 420
DeezerHU 47538 222887 9.377214 0.000197 True 0.116187 0.092924 9.930603 14 True 2869 112
DeezerRO 41773 125826 6.024274 0.000144 True 0.091212 0.075267 12.970340 19 True 6403 112
FacebookLargepage 22470 171002 15.220472 0.000677 True 0.359738 0.232321 10.443525 15 True 2973 709
FeatherDeezerSocial 28281 92752 6.559315 0.000232 True 0.141160 0.095922 15.138963 21 True 6470 172
FeatherLastfmSocial 7624 27806 7.294334 0.000957 True 0.219418 0.178623 10.457240 15 True 1929 216
GithubSocial 37700 289003 15.331724 0.000407 True 0.167537 0.012357 7.228674 11 True 5241 9458
TwitchSocialNetworksDE 9498 153138 32.246368 0.003395 True 0.200886 0.046471 5.085176 7 True 453 4259
TwitchSocialNetworksENGB 7126 35324 9.914117 0.001391 True 0.130928 0.042433 7.001684 10 True 1222 720
TwitchSocialNetworksES 4648 59382 25.551635 0.005499 True 0.222496 0.084235 6.649312 9 True 301 1022
TwitchSocialNetworksFR 6549 112666 34.407085 0.005255 True 0.221706 0.054128 5.089327 7 True 257 2040
TwitchSocialNetworksPTBR 1912 31299 32.739540 0.017132 True 0.319895 0.130981 4.905335 7 True 116 767
TwitchSocialNetworksRU 4385 37304 17.014367 0.003881 True 0.165797 0.048648 6.288940 9 True 461 1229
SocSignSlashdot090221 82140 500481 12.186048 0.000148 True 0.058787 0.023618 8.911346 13 True 30086 2548
SocSlashdot0902 82168 582533 14.179072 0.000173 True 0.060345 0.024109 8.910720 13 True 30035 2554

3. Who knows whom? - Degree distribution#

Degree distribution of networks.

Firstly, we will discuss everything that has to do with the term ‘degree’ in the network theory environment. The degree is the number of connections (or neighbours) a node has. Degrees play an important role in understanding interactions between nodes. This is fundamentally used to measure the connectivity or centrality of a node in a network. For instance, a node with a high degree probably is vital in keeping the network stable and connected. In other words, nodes with a high degree have a lot of influence. A node with a significantly high degree is often referred to as a hub. A hub could act as a central point to exchange information. When looking at real life situations, we see that people with more connections often have power and influence in a social network. As this implies, a famous person usually has a very high degree. On the other hand, chances are that people who live quite remotely, will have a low degree.

It is known that a random network has a binomial degree distribution (Barabasi, 2016). By analysing the degree distribution of the different networks, we can take the first steps into categorizing the networks into real or random networks. If the distribution of a network is binomial, then the network is better described by the random network model. If not, it more likely to be a real network.

degree_dist = go.Figure()
for network_name, network in networks.items():
    degree_sequence = sorted((d for n, d in network.degree()), reverse=True)
    degree_dist.add_trace(go.Scatter(x=degree_sequence,
                                     y=np.arange(len(degree_sequence)),
                                     mode='lines',
                                     name=network_name))

degree_dist.update_layout(title="Degree distribution of all networks",
                          xaxis_title="Degree",
                          yaxis_title="Nodes",
                          xaxis_type="log"
                         )
degree_dist.show()

As the above chart shows, the degree distribution is far from binomial. The x-axis is even made logarithmic to account for the exponential distribution of the degree. Thus, considering the degree distribution. All observed networks are most likely real networks.

4. Connectedness#

Klein stukje tekst over connectedness in real networks. Hoeft niet met visualisatie. Moet nog geschreven worden voor de final tekst.

5. Small world property#

Only few steps needed to reach any arbitrairy node in the network.

Some definitions

If it is possible to follow links to go from one node to another node in the network, it could be said that there is a path between the two nodes. The path length is the number of links in a path. Between the same nodes, there can be multiple paths. Therefore, the path with the least number of links is called the shortest path. Subsequently, the distance between two nodes is defined as the length of the shortest path between nodes, also called the shortest path length.

In this project the focus was layed upon networks as a whole and less on individual nodes in a network. For this reason, it should be investigated how close or far we expect nodes to be in a network. To achieve this, another network property is introduced, called the average shortest path length. The average shortest path length of a network is calculated by averaging the shortest path lengths across all pairs of nodes.

\(a =\sum_{\substack{s,t \in V \\ s\neq t}} \frac{d(s, t)}{n(n-1)}\)

where V is the set of nodes in G, d(s, t) is the shortest path from s to t, and n is the number of nodes in G.

Another network property related to shortest paths, is the diameter, defined as the maximum shortest path length across all pairs of nodes.

Small world property

The small world property means that there are only a few steps needed to reach any arbitrairy node in the network. In other words, two individuals anywhere in the world can be connected through a few acquaintances. This relationship can be expressed mathematically. According to Menczer et al (2020) the average path length scales logarithmically with the size of the network.

\(a \approx log(n)\)

where n is the number of nodes in G.

This raises the question of whether or not this small world property is true on the networks that will be analysed.

swp = px.scatter(descriptive_stats,
                 x="number_of_nodes",
                 y="average_shortest_path_length",
                 size = "diameter",
                 trendline= "ols",
                 trendline_options=dict(log_x=True),
                 color = "diameter",
                 title="Relation between average path length and number of nodes",
                 hover_name = descriptive_stats.index)
swp.update_layout(
    xaxis = dict(
        title = 'Amount of nodes'),
    yaxis = dict(
        title = 'Average length between each node')
    )
swp.show()

According to Barabasi, the small-world property is the only property that can be explained by the random network model. Thus, with this plot, there is no way to clearly distinguish whether the networks are real or random. However, the takeaways are firstly that for some networks the average path length indeed scales logarithmically with the size of the network. Secondly, there are only a handful of networks with an short average path lengths compared to the number of nodes in that network. Unsurprisingly these are also the networks with the shortest diameter.

6. Friendships in communities - Triadic closure#

Meet new friends through shared contacts.

In a social network, if A and B are both friends with C, then it is very possible that A and B are friends with each other. This means that there is a good chance that a friend of my friend is also my friend. The connectivity among neighbors of the nodes is important because it shows how tightly clustered the nodes are. The clustering coefficient takes the fraction of the neighbors of a node that are connected. In this project, we aim to focus more on networks in their entirety rather than nodes themselves. Therefore, we chose to examine the average clustering coefficient of our networks. Networks with a low average clustering coefficient have very few triangles and networks with a high average clustering coefficient have many triangles. A triangle is a set of three nodes (triad) where each pair of nodes is connected. When we meet people through shared contacts, we form closing triangles. The plethora of closed triangles is defined by a phenomenon called triadic closure. This phenomenon plays a crucial role in the formation and maintenance of communities within networks.

In conclusion, the average clustering coefficient builds and maintains communities while the average degree shows how dense the communities are in a social network. This raises an interesting question about the relationship between the average clustering coefficient and the average degree. By plotting the correlation between these two variables over all datasets, we see that there is a strong positive correlation of 0.930455. We followed up by plotting a figure that gives insight of what makes the correlation so strong between the various networks with their characteristics. The corresponding figure is a bubble chart using three variables, avg_cc, average_degree, and number_of_nodes.

descriptive_stats['average_degree'].corr(descriptive_stats['avg_cc'])
0.9304554018401265
bubble = px.scatter(descriptive_stats, 
    x = 'average_degree', 
    y = 'avg_cc',
    size = 'number_of_nodes',
    labels = list(descriptive_stats.index),
    title = 'Plotsketch: Average clustering coefficient per average degree',
    hover_name = descriptive_stats.index,
    color = descriptive_stats.index,
    log_x = True, size_max = 60)

bubble.update_layout(
    title = 'Average clustering coefficient per average degree',
    xaxis = dict(
        title = 'Average degree',
        gridcolor = 'white',
        type = 'log',
        gridwidth = 2,
    ),
    yaxis = dict(
        title = 'Average clustering coefficient',
        gridcolor = 'white',
        gridwidth = 2,
    ),
    paper_bgcolor = 'rgb(243, 243, 243)',
    plot_bgcolor = 'rgb(243, 243, 243)',
)
bubble.show()
colors = px.colors.qualitative.Plotly[:len(networks)]
avg_cc = px.bar(descriptive_stats, x=list(networks.keys()), y='avg_cc', color=list(networks.keys()), color_discrete_sequence=colors, title='Average clustering coefficient for all networks')
avg_cc.update_layout(xaxis_title='Networks', yaxis_title='Average clustering coefficient')
avg_cc.show()

Sketch visualisation#

According to Barabasi (2016), in a real network high degree nodes tend to have a smaller clustering coefficient than low degree nodes. We would like to plot this for the networks. Currently we can plot this only for NashvilleMeetupNetwork. NashvilleMeetupNetwork is chosen first because this is the network with the highest average clusering coefficient.

In order to minimize the loading time of the jupyter notebook, the computations are done in advance and written to a csv file, which then can be loaded in.

# degree_dict = dict(NashvilleMeetupNetwork.degree())
# cc_dict = dict(nx.clustering(NashvilleMeetupNetwork))
# data = {'degree': degree_dict, 'cc' : cc_dict}
# NashVilleMeetupNetwork_degree_cc_ = pd.DataFrame(data)
# NashVilleMeetupNetwork_degree_cc_.to_csv('results/NashvilleMeetupNetwork_degree_vs_cc.csv')
NashVilleMeetupNetwork_degree_cc_ = pd.read_csv('results/NashvilleMeetupNetwork_degree_vs_cc.csv', index_col=0)
fig = px.scatter(NashVilleMeetupNetwork_degree_cc_, x="degree", y="cc", title="Plotsketch: NashvilleMeetup clusering coefficient vs degree")
fig.show()
NashVilleMeetupNetwork_degree_cc_.corr()
degree cc
degree 1.000000 -0.645953
cc -0.645953 1.000000

Indeed, this looks like a real network. A pearson correlation of -0.65 futher substantiates this hypothisis.

7. Conclusion#

Given the fact that real networks with a small world property, tend to have a high clustering coefficient, it can be concluded that NashvilleMeetupNetwork is the only network that satisfies all real network properties. Thus, NashvilleMeetupNetwork is the only network that is a real network.

This begs the explanation what the other networks are and why the other networks lack high clustering coefficients. The NashvilleMeetupNetwork is the only network where the edges represent real world interactions and not interactions in a digital environment. Thus, we assume that the way humans interact digitally is completely different than the way humans interact in real life. In the end we are social creatures.

8. References#

Barabasi, A. L. (2016). Network Science. Cambridge: Cambridge University Press. http://networksciencebook.com/

Menczer, F., Fortunato, S., & Davis, C. (2020). A First Course in Network Science. Cambridge: Cambridge University Press. doi:10.1017/9781108653947

Appendix 1#

Code used to compute similarities in the different networks.

(All node attributes files for the different datasets are structured in a different way. Due to the short time available for this project and the aim on the visualisatioins, we did not prioritize writing code that would work on all different network datasets. That is why only a subset of all networks can be used for this analysis.)

# def write_json_to_edge_list(json_path, edge_path):
#   with open(json_path, 'r') as file:
#     json_object = json.loads(file.read())

#   # Create empty lists for sources and targets
#   sources = []
#   targets = []

#   # Process the JSON data
#   for source, genres in json_object.items():
#     for genre in genres:
#       sources.append(source)
#       targets.append(genre)

#   # Create the cuDF DataFrame
#   df = cudf.DataFrame({'source': sources, 'target': targets})

#   df.to_csv(edge_path)
#   return

# def write_jaccard_similarity_to_csv(edge_path, jaccard_sim_path):
#   G_edges = cudf.read_csv(edge_path, index_col=0, dtype='str')
#   G_nodes = G_edges['source'].unique()
#   # print(G_nodes)
#   G = cnx.from_cudf_edgelist(G_edges, 'source', 'target', create_using=cnx.Graph)
#   Jaccard = cnx.jaccard(G)
#   print(G.number_of_nodes())
#   # display(Jaccard)
#   # J2 = Jaccard[Jaccard['first'].isin(G_nodes) & Jaccard['second'].isin(G_nodes)]
#   # display(J2)
#   # J3 = J2[J2['first'] != J2['second']]
#   # J4 = J3.sort_values(by='jaccard_coeff', ascending=False)
#   # display(J4)
#   Jaccard = Jaccard.rename(columns={'jaccard_coeff': 'jaccard_genre_sim'})
#   Jaccard.to_csv(jaccard_sim_path)
#   return

# def similarity_comparison(edge_path, jaccard_sim_path, network, final_sim_path):
#   # write_jaccard_similarity_to_csv(edge_path, jaccard_sim_path)
#   jaccard_genre_sim = cudf.read_csv(jaccard_sim_path, index_col=0)
#   # display(jaccard_genre_sim)
#   Jaccard = cnx.jaccard(network)
#   Overlap = cnx.overlap(network)
#   J_O_join = Jaccard.merge(Overlap, how='inner')
#   final = J_O_join.merge(jaccard_genre_sim2, ['first', 'second'], how='left')
#   final = final.dropna()
#   final.to_csv(final_sim_path)
#   return

Appendix 2#

Nashville https://www.kaggle.com/datasets/stkbailey/nashville-meetup?select=rsvps.csv

Deezer https://www.kaggle.com/datasets/andreagarritano/deezer-social-networks

Social Graphs https://www.kaggle.com/datasets/wolfram77/graphs-social

# Handig voor inzicht in correlaties gedurende het project
descriptive_stats.corr()
number_of_nodes number_of_edges average_degree density connected_network avg_cc transitivity average_shortest_path_length diameter has_bridges number_of_bridges max_degree
number_of_nodes 1.000000 0.405829 -0.243899 -0.497133 NaN -0.438849 -0.275756 0.404563 0.434850 NaN 0.857008 0.127392
number_of_edges 0.405829 1.000000 0.771754 0.352650 NaN 0.590780 0.655455 -0.220672 -0.216164 NaN 0.327638 0.214918
average_degree -0.243899 0.771754 1.000000 0.760696 NaN 0.930455 0.886746 -0.478397 -0.495127 NaN -0.212021 0.104513
density -0.497133 0.352650 0.760696 1.000000 NaN 0.789591 0.651239 -0.631171 -0.648358 NaN -0.355927 -0.028833
connected_network NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
avg_cc -0.438849 0.590780 0.930455 0.789591 NaN 1.000000 0.946820 -0.427280 -0.446112 NaN -0.394952 0.026952
transitivity -0.275756 0.655455 0.886746 0.651239 NaN 0.946820 1.000000 -0.188503 -0.208922 NaN -0.300336 -0.159003
average_shortest_path_length 0.404563 -0.220672 -0.478397 -0.631171 NaN -0.427280 -0.188503 1.000000 0.996961 NaN 0.270928 -0.359790
diameter 0.434850 -0.216164 -0.495127 -0.648358 NaN -0.446112 -0.208922 0.996961 1.000000 NaN 0.300537 -0.320203
has_bridges NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
number_of_bridges 0.857008 0.327638 -0.212021 -0.355927 NaN -0.394952 -0.300336 0.270928 0.300537 NaN 1.000000 0.143028
max_degree 0.127392 0.214918 0.104513 -0.028833 NaN 0.026952 -0.159003 -0.359790 -0.320203 NaN 0.143028 1.000000
# is alleen accuraat bij restart run all
end = time.time()
time_in_seconds = end-start
print(f"Tijd om notebook te runnen: {time_in_seconds}")
Tijd om notebook te runnen: 8.466467142105103